## This code cell will not be shown in the HTML version of this notebook
# run animator for two-class classification fits
csvname = datapath + '2eggs_data.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)
# run animator
demo.brows_single_fits(num_units = [v for v in range(0,30,1)], basis = 'poly',view = [30,-80])
This sort of trend holds for multiclass classification (and unsupervised learning problems as well), as illustrated in the example below. Here we have tuned $100$ single layer tanh neural network units minimizing the multiclass softmax cost to fit a toy $C=3$ class dataset. As you move the slider below from left to right weights from a run of $10,000$ steps of gradient descent are used, with steps from later in the run used as the slider is moved from left to right.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + '3_layercake_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib8 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib8.choose_features(name = 'multilayer_perceptron',layer_sizes = [2,100,3],activation = 'tanh')
# choose normalizer
mylib8.choose_normalizer(name = 'standard')
# choose cost
mylib8.choose_cost(name = 'multiclass_softmax')
# fit an optimization
mylib8.fit(max_its = 10000,alpha_choice = 10**(-1))
# plot cost history
mylib8.show_histories(start = 10)
# load up animator
demo8 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 30 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo8.multiclass_animator(mylib8,num_frames,scatter = 'points',show_history = True)
This same phenomenon holds if we perform any other sort of learning - like classification. Below we use a set of stumps to perform two-class classification, training via gradient descent, on a realistic dataset that is reminiscent of the 'perfect' three dimensional classification dataset shown in the previous Subsection. As you pull the slider from left to right the tree-based model employs weights from further in the optimization run, and the fit gets better.
## This code cell will not be shown in the HTML version of this notebook
import copy
import sys
sys.path.append('../../')
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
import autograd.numpy as np
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
# load in data
csvname = datapath + '2eggs_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib7 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib7.choose_features(name = 'stumps')
# choose normalizer
mylib7.choose_normalizer(name = 'none')
# choose cost
mylib7.choose_cost(name = 'softmax')
# fit an optimization
mylib7.fit(max_its = 5000,alpha_choice = 10**(-2))
# plot
demo7 = nonlib.run_animators.Visualizer(datapath + '2eggs_data.csv')
frames = 10
demo7.animate_static_N2_simple(mylib7,frames,show_history = False,scatter = 'on',view = [30,-50])
This same problem presents itself with all real supervised / unsupervised learning datasets. For example, if we take the two-class classification dataset shown in the third example of this Subsection and more completely tune the parameters of the same set of stumps we learn a model that - while fitting the training data we currently have even better than before - is far too flexible for future test data. Moving the slider one knotch to the right shows the result of a (nearly completely) optimized set of stumps trained to this dataset - with the resulting fit being extremely nonlinear (far to nonlinear for the phenomenon at hand).
## This code cell will not be shown in the HTML version of this notebook
# load in data
csvname = datapath + '2eggs_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib12 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib12.choose_features(name = 'stumps')
# choose normalizer
mylib12.choose_normalizer(name = 'none')
# choose cost
mylib12.choose_cost(name = 'softmax')
# fit an optimization
mylib12.fit(optimizer = 'newtons method',max_its = 1)
# plot
demo12 = nonlib.run_animators.Visualizer(datapath + '2eggs_data.csv')
frames = 2
demo12.animate_static_N2_simple(mylib12,frames,show_history = False,scatter = 'on',view = [30,-50])
Below we repeat the experiment above only here we use $50$ stump units, tuning them to the data using $5000$ gradient descent steps. Once again as you move the slider a fit resulting from the certain step of gradient descent reflected on the cost function history is shown, and as you move from left to right the run progresses and the fit gets better.
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + 'universal_regression_samples_0.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib6 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib6.choose_features(name = 'stumps')
# choose normalizer
mylib6.choose_normalizer(name = 'none')
# choose cost
mylib6.choose_cost(name = 'least_squares')
# fit an optimization
mylib6.fit(max_its = 5000,alpha_choice = 10**(-2))
# load up animator
demo6 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 100 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo6.animate_1d_regression(mylib6,num_frames,scatter = 'points',show_history = True)
Another classic sub-family of kernel universal approximators: sine waves of increasing frequency. This consists of the set of sine waves with frequency increasing by an e.g., integer factor like
$$f_1(x) = \text{sin}(x), ~~ f_2(x) = \text{sin}(2x), ~~ f_3(x) = \text{sin}(3x), ...$$where the $m^{th}$ element given as $f_m(x) = \text{sin}(mx)$.
Below we plot the table of values for the first four of these catalog functions using their equations.
## This code cell will not be shown in the HTML version of this notebook
# build the first 4 non-constant polynomial basis elements
x = np.linspace(-5,5,100)
fig = plt.figure(figsize = (10,3))
for m in range(1,5):
# make basis element
fm = np.sin(m*x)
fm_table = np.stack((x,fm),axis = 1)
# plot the current element
ax = fig.add_subplot(1,4,m)
ax.plot(fm_table[:,0],fm_table[:,1],color = [0,1/float(m),m/float(m+1)],linewidth = 3)
ax.set_title('$f_'+str(m) + ' = ' + '$sin$ ' + '(' + str(m) + 'x)$',fontsize = 18)
# clean up plot
ax.grid(True, which='both')
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
plt.show()
As with the polynomials, notice how each of these catalog of elements if fixed. They have no tunable parameters inside, the third element always looks like $f_3(x) = \text{sin}(x)$ - that is it always takes on that shape. Also note, like polynomials to generalize this catalog of functions to higher dimensional input we shove each coordinate through the single dimensional version of the function separately. So in the case of $N=2$ inputs the functions take the form
\begin{equation} f_1(x_1,x_2) = \text{sin}(x1), ~~ f_1(x_1,x_2) = \text{sin}(2x_1)\text{sin}(5x_2), ~~ f_3(x_1,x_2) = \text{sin}(4x_1)\text{sin}(2x_2), ~~ f_4(x_1,x_2) = \text{sin}(7x_1)\text{sin}(x_2), ~~ ... \end{equation}And these are listed in no particular order, and in general we can write a catalog element as $f_m(x_1,x_2) = \text{sin}(px_1)\text{sin}(qx_2) $ where $p$ and $q$ are any nonnegative integers.
We describe the kernel family in significantly more detail in Chapter 15.
Choosing another elementary function gives another sub-catalog of single-layer neural network functions. The rectified linear unit (or 'relu' for short) is another popular example, elements of which (for single dimensional input) look like
\begin{equation} f_1(x) = \text{max}\left(0,w_{1,0} + w_{1,1}x\right), ~~ f_2(x) = \text{max}\left(0,w_{2,0} + w_{2,1}x\right), ~~ f_3(x) = \text{max}\left(0,w_{3,0} + w_{3,1}x\right), ~~ f_4(x) = \text{max}\left(0,w_{4,0} + w_{4,1}x\right), ... \end{equation}Since these also have internal parameters each can once again take on a variety of shapes. Below we plot $4$ instances of such a function, where in each case its internal parameters have been set at random.
## This code cell will not be shown in the HTML version of this notebook
# build 4 instances of a composition basis: line and tanh
x = np.linspace(-5,5,100)
fig = plt.figure(figsize = (10,3))
for m in range(1,5):
# make basis element
w_0 = np.random.randn(1)
w_1 = np.random.randn(1)
fm = np.maximum(0,w_0 + w_1*x)
fm_table = np.stack((x,fm),axis = 1)
# plot the current element
ax = fig.add_subplot(1,4,m)
ax.plot(fm_table[:,0],fm_table[:,1],c='r',linewidth = 3)
ax.set_title('$f$ instance ' + str(m),fontsize = 18)
# clean up plot
ax.grid(True, which='both')
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
plt.show()
To handle higher dimensional input we simply take a linear combination of the input, passing the result through the nonlinear function. For example, an element $f_j$ for general $N$ dimensional input looks like the following using the relu function
\begin{equation} f_j\left(\mathbf{x}\right) = \text{max}\left(0,w_{j,0} + w_{j,1}x_1 + \cdots + w_{j,\,N}x_N\right). \end{equation}As with the lower dimensional single layer functions, each such function can take on a variety of different shapes based on how we tune its internal parameters. Below we show $4$ instances of such a function with $N=2$ dimensional input.
## This code cell will not be shown in the HTML version of this notebook
# generate input values
s = np.linspace(-2,2,100)
x_1,x_2 = np.meshgrid(s,s)
degree_dict = {}
# build 4 polynomial basis elements
fig = plt.figure(num=None, figsize = (10,4), dpi=80, facecolor='w', edgecolor='k')
### plot regression surface ###
p = [0,1,1,2]
q = [1,2,1,3]
for m in range(4):
ax1 = plt.subplot(1,4,m+1,projection = '3d')
ax1.set_axis_off()
# random weights
w_0 = np.random.randn(1)
w_1 = np.random.randn(1)
w_2 = np.random.randn(1)
w_3 = np.random.randn(1)
f_m = w_3*np.maximum(0,w_0 + w_1*x_1 + w_2*x_2)
ax1.plot_surface(x_1,x_2,f_m,alpha = 0.35,color = 'w',zorder = 3,edgecolor = 'k',linewidth=1,cstride = 10, rstride = 10)
ax1.view_init(20,40)
ax1.set_title('$f$ instance ' + str(m+1),fontsize = 18)
fig.subplots_adjust(left=0,right=1,bottom=0,top=1) # remove whitespace around 3d figure
plt.show()
## This code cell will not be shown in the HTML version of this notebook
# build 4 instances of a composition basis: line and tanh and tanh
x = np.linspace(-5,5,100)
fig = plt.figure(figsize = (10,3))
for m in range(1,5):
# make basis element
fm = 0
for j in range(10):
w_0 = np.random.randn(1)
w_1 = np.random.randn(1)
w_3 = np.random.randn(1)
fm+=w_3*np.tanh(w_0 + w_1*x)
w_2 = np.random.randn(1)
w_3 = np.random.randn(1)
fm = np.tanh(w_2 + fm)
fm_table = np.stack((x,fm),axis = 1)
# plot the current element
ax = fig.add_subplot(1,4,m)
ax.plot(fm_table[:,0],fm_table[:,1],c='r',linewidth = 3,zorder = 3)
ax.set_title('$f$ instance ' + str(m),fontsize = 18)
# clean up plot
ax.grid(True, which='both')
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
plt.show()
To create a more flexible decision tree basis function we split each level of the stump. This gives us a tree of depth 2 (our first split gave us a stump, another phrase for stump is tree of depth 1). We can look at this mathematically / figuratively as in the figure below.
This gives a basis element with four (potentially) distinct levels. Since the location of the splits and the values of the levels can be set in many ways, this gives each element of a tree basis of depth 2 a good deal more flexibility than stumps. Below we illustrate $4$ instances of a depth $2$ tree.
We describe trees in significantly further detail in Chapter 14.
## This code cell will not be shown in the HTML version of this notebook
# build 4 instances of a composition basis: line and tanh
x = np.linspace(-5,5,100)
fig = plt.figure(figsize = (10,3))
for m in range(1,5):
# make basis element
w_0 = 0.1*np.random.randn(1)
w_1 = 0.1*np.random.randn(1)
w_2 = np.random.randn(1)
fm = w_2*np.sign(w_0 + w_1*x)
# make basis element
w_0 = 0.1*np.random.randn(1)
w_1 = 0.1*np.random.randn(1)
w_2 = np.random.randn(1)
gm = w_2*np.sign(w_0 + w_1*x)
# make basis element
w_0 = 0.1*np.random.randn(1)
w_1 = 0.1*np.random.randn(1)
w_2 = np.random.randn(1)
bm = w_2*np.sign(w_0 + w_1*x)
fm += gm
fm += bm
fm_table = np.stack((x,fm),axis = 1)
# plot the current element
ax = fig.add_subplot(1,4,m)
ax.scatter(fm_table[:,0],fm_table[:,1],c='r',s = 20,zorder = 3)
ax.set_title('$f$ instance ' + str(m),fontsize = 18)
# clean up plot
ax.grid(True, which='both')
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
plt.show()
We show an analagous two-class classification example below. Here our 'near-perfect' nonlinear classification dataset consists of $P = BLAH$ points that can be separated perfectly by a circular boundary centered at the origin. Here we show the result of three fully trained models using various numbers of polynomial (left panel), $\text{tanh}$ network (middle panel), and stump units (right panel).
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
csvname = datapath + 'perfect_circle_data.csv'
demo999 = nonlib.main_classification_comparison.Visualizer(csvname)
demo999.runs3 = demo999.run_trees(100,50)
demo999.runs1 = demo999.run_poly(5)
demo999.runs2 = demo999.run_net(50)
# # animate
frames = 5
demo999.animate_comparisons(frames,pt_size = 20)
A perfect two-class classification dataset consisting of $P = 10,000$ input/output pairs with discrete jumps and can too be approximated infinitely closely using a combination of universal approximators, provided we tune its parameters via the minimization of a more appropriate cost - here the Softmax. Here we use $B=100$ universal approximators (here polynomials, described below), and once again as you move the slider from left to right the parameters of the combination are improved so that the approximation looks more and more like the function $y$ itself.
## This code cell will not be shown in the HTML version of this notebook
# load in data
csvname = datapath + 'discrete_function.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:]
# import the v1 library
mylib2 = nonlib.library_v1.superlearn_setup.Setup(x,y)
# choose features
mylib2.choose_features(name = 'multilayer_perceptron',layer_sizes = [1,100,1],activation = 'tanh')
# choose normalizer
mylib2.choose_normalizer(name = 'standard')
# choose cost
mylib2.choose_cost(name = 'softmax')
# fit an optimization
mylib2.fit(max_its = 10000,alpha_choice = 10**(-1))
# load up animator
demo2 = nonlib.run_animators.Visualizer(csvname)
# pluck out a sample of the weight history
num_frames = 100 # how many evenly spaced weights from the history to animate
# animate based on the sample weight history
demo2.animate_1d_regression(mylib2,num_frames,scatter = 'function',show_history = True)